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Forecasting techniques have received considerable interest from both 
researchers and academics because of the unique characteristics of 
businesses and their influence on several areas of the economy. Most 
academics utilize the autoregressive integrated moving average (ARIMA) 
approach to forecasting the future. However, researchers face challenges, 
such as analyzing the data and selecting the appropriate ARIMA parameters, 
especially with large datasets. This study investigates the use of the 
automatic ARIMA (Auto ARIMA) function for forecasting Brent oil prices. 
It demonstrates the benefits of using Auto ARIMA over ARIMA for 
determining the appropriate ARIMA parameters based on measures such as 
root mean square error (RMSE), mean absolute error (MAE), and akaike 
information criterion (AIC) without requiring the attention of an expert data 
scientist as it bypasses several steps needed for manual ARIMA. Auto 
ARIMA produced an RMSE of 12.5539 and an AIC of 1877.224, which are 
comparable to the values resulting from the manual ARIMA with the help of 
expert data scientists; thus, it saves analysis time and offers the best model 


result. 
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1. INTRODUCTION 

Today’s crude oil prices have a tremendous influence on the global economy and security as oil is 
one of the world’s primary energy sources. Because crude oil accounts for a large proportion of certain 
countries’ exports, a rapid shift in price can have severe economic consequences, with crude oil price drops 
resulting in decreased economic activity [1]. Markets are becoming more competitive, especially following 
science and technology’s rapid advancement, which forces businesses to offer a variety of high-quality items 
for customers while remaining cost-effective items [2]. One of the most common methods used for 
forecasting in various fields is the autoregressive integrated moving average (ARIMA) method, a linear time 
series forecasting approach used in finance, engineering, social sciences, and agriculture, among other fields 
[3], [4]. The ARIMA model is the result of combining autoregressive (AR) and moving average (MA) 
models. ARIMA have three paramerters (p, d, and q), where p represents the order of autoregressive terms, d 
refers to non-seasonal differences, and q denotes the order of lagged forecast errors in the forecast equation 
[5], [6]. ARIMA models can accurately forecast relatively steady time series data. However, they assume that 
future data values are proportional to present and historical data values. As a result, many real-life time series 
data exhibit complicated non-linear, seasonal, and non-stationary patterns that ARIMA may struggle to 
adequately capture [7], [8]. Several studies on crude oil time series forecasting have been conducted in recent 
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years [1], [9]-[12]. The problem statement to these studies, determining the three ARIMA parameters is not a 
simple task and required data science specialist, especially when utilizing autocorrelation and partial 
autocorrelation with a huge dataset. Furthermore, a unit root test such as the Dickey-Fuller test is needed to 
convert non-stationary data to stationary data. This conversion takes time and requires an expert to analyze 
the dataset, and because it may be difficult for an expert to deal with big data, the findings may occasionally 
be inaccurate [13]. Nevertheless, accurate time series forecasting is critical as it assists in future planning and 
decision-making and is the basis for greater resource. 

Utilization and service levels [14]. The Akaike information criterion (AIC) is a statistical metric that 
can assess the relative quality of different models by comparing the goodness of fit of each model with that 
of other models [15]-[17]. Thus, the contributions of this paper lie in demonstrating an automate process 
called Auto ARIMA model, how best to utilize it, and what the advantages of Auto ARIMA are over 
traditional ARIMA, namely that it is faster and can directly fit the model after the data preprocessing step 
based on the value of the AIC. The paper is organized as follows: section 1 introduces the topic, section 2 
discusses the related works, section 3 presents the methodology of time series forecasting; section 3.1 
discusses traditional ARIMA, section 3.2 examines Auto ARIMA; section 4 presents the experimental results 
for traditional ARIMA and Auto ARIMA, and section 5 concludes the study. 


2. RELATED WORK 

Time series forecasts are becoming increasingly important because they are living proof and a 
worldwide business language [18]. Previous authors have performed several studies on the ARIMA 
algorithm for forecasting. For example, Siregar et al. [2] used SAS software and the ARIMA method to 
predict raw material requirements for plastic products depending on income data. MAPE was used to assess 
the accuracy of the predicted outcomes. The result of the forecasting for 2015 using ARIMA (3,0,2) on sales 
data for plastic products between 2012 and 2014 showed an increase in forecast accuracy. The research’s 
weakness is that forecasting requires a human expert to analyze the complex data, and the model is not 
designed for long-term prediction; as a result, it will likely be flat or constant. 

Sahinli [3], implemented an ARIMA model to predict consumer potato prices. They found the 
ARIMA (1,1,2) estimation to be the best model because it had the lowest criterion values, including a MAPE 
of 116.9075, a root mean square error (RMSE) of 201.759, and a mean absolute deviation (MAD) of 
176.896. 

Banerjee et al. [13] applied the ARIMA model (1,0,1) to predict India’s approximate future stock 
market prices using data collected over the six years prior to the study. The results showed a root mean 
square (RMS) of 691.399, a mean absolute percentage error (MAPE) of 3.334, and a mean absolute error 
(MAE) of 506.210. The drawback of this study’s approach is that it assumed that the dataset was linear, 
which may not have been the case. As a result, the method is rendered worthless for non-linear systems. 

Fattah et al. [18], used the Box—Jenkins time series approach and developed an ARIMA model to 
estimate the finished product demand forecasting in a food factory. The most suitable model was chosen 
based on four criteria: AIC, standard error, Schwarz Bayesian criterion (SBC), and maximum likelihood. 
ARIMA (1,0,1) was chosen because it met all four preceding requirements. The results show that the model 
may predict future food demand. The limitation of this work is that the ARIMA model was designed to 
function with stationary data; hence when non-stationarity data are used, ARIMA provides low accuracy. 

Ohyver and Pudjihastuti [19], proposed an ARIMA (1,1,2) model for forecasting rice prices, which 
was found to have good accuracy for medium-quality rice (RMSE=14.22316, AIC=4645.1), while ARIMA 
(2,0,2) had an RMSE of 45.53879 and an AIC of 5984.69. The drawback of the study is that ARIMA is only 
suitable for short-term forecasting. Therefore, it cannot be used for long-range forecasting. 

Bandyopadhyay et al. [20], used an ARIMA (1,1,1) model to predict future gold prices based on the 
historical gold price data over the previous 10 years of traded values. ARIMA (1,1,1) was selected from six 
various model parameters because it was most effective and met all of the fit statistics criteria, whereas the 
other five did not. ARIMA (1,1,1) provided an RMS of 719.18, a MAPE of 3.245, and an MAE of 477.330. 
The challenge of the study’s dataset was that under economic instability or certain government policies, it 
becomes impossible to record precise changes in gold prices, making the model ineffectual for forecasting in 
that situation. Furthermore, the approach depends on the linearity of the historical data, yet there is no proof 
that gold prices are linear. 

Borucka [21], suggested an ARIMA (1,0,3) model based on the concept of a relationship between 
the values of a time series at one moment and their values at the last moments. The given model accurately 
predicted the number of road accidents, demonstrating its ability to forecast them. However, the study’s 
weakness was that identification, which involves determining optimal values for function parameters, is 
challenging to develop for the ARIMA model. 


High performance time series models using auto autoregressive integrated moving ... (Redha Ali Al-Qazzaz) 


424 m) ISSN: 2502-4752 


Kumar and Vanajakshi [22], attempted to address the issue described above by offering a short-term 
traffic flow prediction strategy with limited input data using the seasonal ARIMA (SARIMA) model. 
ARIMA (2,0,0) (0,1,1) displayed an AIC of 4,218.34 which is less than that of other models. The preceding 
study shows that there are several obstacles and limits, but one in particular stands out: the difficulty of 
determining the order of the p, d, and q parameters in the ARIMA model. This is what this research paper 
will address. 


3. RESEARCH METHOD 

This section first introduces the dataset used in this study used in this research. After that, we will go 
through the fundamental of the forecasting strategy (ARIMA) model. Finally describes the proposed model 
used to predict future crude oil prices on international markets, focusing on the Auto ARIMA approach. 


3.1. Dataset 

There are many crude oil markets around the world. The dataset used in this paper comes from the 
US Energy Information Administration, which can be accessed from data.nasdaq.com. The only fields in the 
CSV file present dates and prices. The data comprises daily historical Brent Oil prices from 16 January 2016 
through 31 December 2019. A sample of this dataset is shown in Table 1. 


Table 1. Daily Brent crude oil prices 


Date Price 
4-Jan-16 36.28 
5-Jan-16 35.56 
6-Jan-16 33.89 
7-Jan-16 33.57 


3.2. ARIMA 

An ARIMA model combines autoregression and moving average with a difference in time series 
analysis. These models are used to fit data across time to improve data identification or forecast future points 
in the series [19]. This approach is also called the Box—Jenkins method. If the data are discovered to not be 
stationary, they are reduced using the differencing technique. The ARIMA approach uses three parameters: 
p, d, and q. The ARIMA model’s p parameter indicates the number of lag periods [23], [24]. For example, if 
p=2 is used in the auto-regression component of the equation, two preceding periods of the time series are 
employed. Parameter d represents the number of differencing transformations performed to eliminate trends 
and/or seasonality, thereby changing the time series into a stationary one (keeping the mean and variance 
constant across time) [25]. This is a crucial step in preparing the data for an ARIMA model. The lag of the 
error component of the ARIMA model is represented by parameter q [26]. The error component is the part of 
the time series that cannot be explained by trend or seasonality [9]. This can also be represented as (1). 


Ve = C+ P1Yt-1 + P2Yt-2 + + PpYt-p + et — 01et-1 — 02€t-2 — + — Oqet-q (1) 


Where @, is the coefficient of the autoregressive model; @ is the coefficient of the moving average model; 
yz is the current day, y,_; is the previous day, and y,_2 is two days prior; and c is the constant. 

The Dickey—Fuller test is used to examine time series data, stationary or not [27]. Dickey—Fuller is a 
statistical significance test used to determine whether null hypotheses are accepted or rejected. The null 
hypothesis, in this case, is that non-stationary time series data exist. If the test’s p-value is greater than 0.05, 
the Dickey—Fuller null hypothesis will be accepted, which indicates that the data are non-stationary. 
Otherwise, the null hypothesis is rejected and the data are considered stationary [28]. If the time series data 
are non-stationary, the differencing procedure must be performed. First, the current day should be subtracted 
from the previous day as a differencing operation. Next, the Dickey—Fuller test should be repeated to 
determine whether the time series is still non-stationary. If so, then this process of running the test and 
performing differencing should be repeated until the time series becomes stationary [29].Furthermore, the 
autocorrelation function (ACF) and partial autocorrelation function (PACF) are statistical approaches for 
assessing how closely values in a time series are correlated, a crucial stage in ARIMA implementation. ACF 
and PACF plots are used to estimate the input parameters for the ARIMA model [23], [30]. 

The values of p and q may be determined from ACF and PACF charts using some rules. For 
example, if an ACF chart displays a sharp fall in autocorrelation at lag k but a smoother decrease after lag k, 
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then p should be zero and q should be adjusted to k (the most significant lag value). On the other hand, if the 
partial autocorrelation has reduced dramatically at lag k but the ACF chart indicates a smoother reduction, 
then q should be O and p should be adjusted. The general ARIMA model algorithm is presented in 
Algorithm 1. 


Algorithm 1: ARIMA 
Input: Crude oil dataset time series 
Output: Array of forecasted values 
Begin: 
#Load data of time series 
Step 1: dataset © {x1, Xz, = Xn} 
#Preprocessing Data 
Step2: All columns should be removed expect the price columns 
#Identification to determine whether or not the time series data is stationary 
Step 3: Result «—Dickey-Fuller test on the dataset; 
counter =0 
While Result < 0.05 do 
If diff == Null 
diff e differencing(dataset); #current day - previous day 
else 
diff e differencing(diff ) ; 
end 
Result «—Dickey-Fuller test on the dataset; 
counter © counter +1; 
end 
#assign d value 
Step4: d-counter 
Step5: plots ACF and PACF 
Step6: determine the order of p by observing PACF and q by observing ACF 
#Fit ARIMA model 
Step7: result _model «ARIMA (p,d,q) 7 
#Forecast values on the validation set 
Step8: forecasted array- result_model.predict(start date to end date); 
#Check the model’s performance by calculating errors 
Step9: errore forecasted_array-actual_array 


The three last stages are crucial for the time series stages. First, the predicted values depend on the 
variables or the other associated variables’ known past values. The model is found to be appropriate in the 
analysis section, and the most suitable model can be utilized for future forecasts. 

Finally, the future values are estimated and the model’s performance is checked by calculating 
errors using the predictions and actual values on the validation set. The difference between the actual and 
predicted values is a forecast error. MAE and RMSE, described in (2) and (3), are the metrics used most 
often for prediction accuracy [13], [25], [31]. 


1 A 
MAE = — bt=m+1 lve — el (2) 


RMSE = —— re G (3) 


3.3. Auto ARIMA 

In time series applications, many decision processes require high forecasting accuracy. ARIMA 
models are robust tools for time series analysis, but the model prediction using p, q, and d parameters must be 
analyzed [32]. Auto ARIMA saves time and perfects these parameters by iterating through the p, q, and d 
values. It aids in selecting the best set of these parameters and their integration into the ARIMA model [33]. 
Estimating the AIC can help estimate a combination of these parameters, as the best combination of p, q, and 
d is achieved using a lower AIC value. The auto ARIMA model helps avoid some of the steps in the ARIMA 
modelling technique by offering the best combination and increasing the model’s performance. The 
following is the general auto ARIMA Model algorithm. 

From comparing the two algorithms 1 and 2, we notice that the advantage of auto ARIMA over 
ARIMA is that auto ARIMA does not depend on manual visual observation of the ACF and PACF and dose 
not need an advice of an expert. Thus, it skips stages 3-5. As shown in Figure 1, from simple observation we 
notice that ARIMA has nine steps, while Auto ARIMA has only five steps. 
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Algorithm 2: Auto ARIMA 
Input: Crude oil dataset time series 
Output: Array of forecasted values 
Begin: 
#Load data of time series 
Step 1: dataset © {x1 Xz =, Xn}; 
#Preprocessing Data 
Step2: All columns should be removed expect the price columns 
#use Auto ARIMA model 
Step3: result_model «Auto ARIMA (p,d,q); 
#Forecast values on validation set 
Step4: forecasted _ array result_model.predict (start date to end date); 
#Check the model’s performance by calculating errors 
Step5: errore forecasted _array-actual_ array 


| Load dataset | 


| 


| Preprocessing 


| 


Identification | 


(Whether Data is Stationary 
or not) 


l 


l Estimation | | Load dataset | 
(Parameter (ACF & PACF)] 


I | 


| Diagnostic Checking | | 


[Model selection (p. d, q)) Preprocessing 


| I 


Forecasting Forecasting 
(Future Prediction) (Future Prediction) 
Predict values on Predict values on 
validation set validation set 
Calculate RMS Calculate RMS 
ARIMA Model Steps Auto ARIMA Steps 


Figure 1. ARIMA and auto ARIMA steps 


4. RESULTS AND DISCUSSION 

This section analyzes the performance of the traditional ARIMA and auto ARIMA models using 
historical data from the Brent crude oil market. The original dataset was broken down into two parts: 70 
percent was used as a training set for regular ARIMA and for Auto ARIMA parameter estimation, and the 
remaining 30 percent was used as a test set to assess the performance of the models. The results, analyses, 
and comparisons between models were based on many measurements such as RMSE, MAE, and AIC. 


4.1. ARIMA model result 

This research was carried out in a series of steps based on the Box—Jenkins approach. By relying on 
Algorithm 1 and after implementing the first two steps, the third step was used to determine whether the time 
series data were stationary or not by using the Dickey—Fuller test, because the ARIMA model only works 
with stationary time series data. The first Dickey—Fuller test produced a p-value of 0.4306. A p-value>0.05 
signifies that the data are not stationary. If the data are non-stationary, they must be transformed using the 
difference operation. Step 5 utilizes ACF and PACF plots to calculate the number of orders of the p and q 
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parameters. The order of p can be found in the PACF plot, while q can be found in the ACF plot. In Figure 2 
shows that ACF has one significant lag, and the other lag is not significant and under 0.05 (i.e., the first lag is 
higher than the others and more significant than the confidence area represented by the blue colour). Thus, q 
is inferred to be one. 

Like with ACF, the same logic applies in PACF. Thus, the results of PACF are comparable to those 
of ACF. There is just one significant lag while the rest are not significant, so p is determined to be 1 as 
illustrated in Figure 3. 

The difference between ACF and PACF is that instead of identifying connections between present 
time and lags as ACF does, PACF looks for correlations between residuals (i.e., what remains after 
eliminating the impacts that the previous lag) and the next lag value. Data analysis experts sometimes try 
multiple ARIMA models based on orders of p and q which infernces it from plots each of the ACF and 
PACF, for example, on the above dataset. These data analysts test ARIMA (1,0,0), then ARIMA (0,0,1), and 
so on, determining the percent error after each until they achieve a minimum error. As shown in Table 2. 

When the three parameters (p, d, and q) were identified, the best ARIMA model was determined to 
be ARIMA (1,0,1). As shown in Figure 4 we notice that ARIMA (1,0,1), the green line crosses most of the 
points. Also ARIMA (1,0,1) has the fewest errors (MAE and RMSE), as further shown in Table 2. 


Autocorrelation 


Figure 2. Autocorrelation function 


Partial Autocorrelation 


Lag 


Figure 3. Partial autocorrelation 
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Table 2. Accuracy measures of the ARIMA model’s attempt 
Methods MAE RMSE 
ARIMA (1, 0, 0) 11.4462 13.0222 
ARIMA (0, 0, 1) 18.6858 19.8968 
ARIMA (1, 0, 1) 10.9729 12.5539 
x 1 2160 2017-01 2016-01 (1,0,0) 2019-0 7 1 
mka aia T a aia abe 
ais re a ae 


Figure 4. The green line represents the model 


4.2. Auto ARIMA result 
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Auto ARIMA does not require the use of the Dickey—Fuller test to determine whether the data are 


stationary, nor does it require ACF and PACF to establish p and q. Instead, Auto ARIMA performs the three 
phases of the manual inspection procedure. Figure 5 shows that Auto ARIMA performed the analysis and 
provided parameters p, d, and q comparable to those provided by experts in manual ARIMA. It chose 
ARIMA (1,0,1) as the best model because it had a lower AIC of 1877.224, as illustrated in Figure 4. The 
performance of the models was then examined by calculating errors using the predictions and actual values 
from the validation set. The best fit model was estimated based on the minimal values of MAE, RMSE, and 
AIC, as shown in Table 3. Therefore, the ARIMA model (1,0,1) was most suitable for this dataset. 


Performing stepwise search to minimize aic 


ARIMA(®,@,0)(0,0,0)[2] : AIC=7773.871, Time=@.@2 sec 
ARIMA(1,@,®)(®,0,0)[@] : AIC=inf, Time=@.17 sec 
ARIMA(@,@,1)(@,0,0)[@] : AIC=inf, Time=@.18 sec 
ARIMA(1,@,1)(®,0,0)[@] : AIC=1879.0@0, Time=0.33 sec 
ARIMA(1,@,1)(@,@,@)[@] intercept : AIC=1877.224, Time=@.4@ sec 
ARIMA(@,@,1)(0,0,@)[@] intercept : AIC=4232.578, Time=0.19 sec 
ARIMA(1,@,®)(@,0,@)[@] intercept : AIC=inf, Time=@.19 sec 
ARIMA(0,0,0)(0,0,0)[0] intercept : AIC=5121.293, Time=0.03 sec 


Best model: 


ARIMA(1,0,1)(@,0,0)[@] intercept 


Total fit time: 1.513 seconds 


Figure 5. Trace of auto ARIMA 


From all the phases of manual ARIMA, steps 3, 4, and 6 were shortened while the results remained 
the same. This eliminates the need for professionals when analyzing the data and selecting parameter values. 
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Depending only on manual visual inspections and following expert-recommended guidelines may predict a 
small univariate time series dataset, such as our dataset above. However, in the case of enormous datasets, 
e.g., IoT sensors, a manual visual inspection would not be possible or would take a tremendous amount of 
time. One solution to this is to employ Auto ARIMA, which allows for the automatic selection of ARIMA 
parameters to choose the most optimal model. 


Table 3. The accuracy of the forecasting model 


Methods MAE RMSE AIC 
ARIMA (1,0,1) 10.9729 12.5539 1877.224 
ARIMA (1,0,0) 11.4462 13.0222 INF 
ARIMA (0,0,1) 18.6858 19.8968 4232.578 


5. CONCLUSION 

This study revealed that the most challenging stage of the traditional ARIMA model is 
identification, the process of determining optimal values for function parameters that requires the 
participation of experts. Therefore, auto ARIMA is recommended for beginners data scientist in forecasting 
in this study because it abbreviates the most difficult and unpleasant stage in the dataset analysis. It select the 
best combination of parameters by using the AIC to compare models and choose the best one. Auto ARIMA 
chose the best model with the lowest AIC. This saves a substantial amount of time and eliminates the need to 
understand the statistics and theory underlying the model selection. Moreover, this strategy reduces the risk 
of human error and the possibility for errors produced by incorrect interpretation of the results. 
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